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Abstract:  A  common  question  asked  by  users  of  direct  search  algorithms  is 
how  to  use  derivative  information  at  iterates  where  it  is  available.  This  paper 
addresses  that  question  with  respect  to  Generalized  Pattern  Search  (GPS)  meth¬ 
ods  for  unconstrained  and  linearly  constrained  optimization.  Specifically  this 
paper  concentrates  on  the  GPS  POLL  step.  Polling  is  done  to  certify  the  need  to 
refine  the  current  mesh,  and  it  requires  O(n)  function  evaluations  in  the  worst 
case.  We  show  that  the  use  of  derivative  information  significantly  reduces  the 
maximum  number  of  function  evaluations  necessary  for  POLL  steps,  even  to  a 
worst  case  of  a  single  function  evaluation  with  certain  algorithmic  choices  given 
here.  Furthermore,  we  show  that  rather  rough  approximations  to  the  gradient  are 
sufficient  to  reduce  the  poll  step  to  a  single  function  evaluation.  We  prove  that 
using  these  less  expensive  POLL  steps  does  not  weaken  the  known  convergence 
properties  of  the  method,  all  of  which  depend  only  on  the  POLL  step. 

Key  words:  Pattern  search  algorithm,  linearly  constrained  optimization,  sur¬ 
rogate-based  optimization,  nonsmooth  optimization,  gradient-based  optimiza¬ 
tion. 


1  Introduction 

We  consider  in  this  paper  the  class  of  Generalized  Pattern  Search  (GPS)  algorithms  applied 
to  the  optimization  problem 

min  f(x)  (1) 

x£X 

where  /  :  I"  ->  R  U  (ex)}  and  where  X  C  Mn  is  defined,  as  in  8],  by  a  finite  set 
of  linear  inequalities:  X  =  {x  G  Mn  :  Ax  <  b},  where  A  G  Qmxn  is  a  rational  matrix 
and  b  G  (M.  U  {cx)})m.  The  way  of  handling  constraints  here  is  the  same  as  in  [22)  2M  IB]- 
Specifically,  we  apply  the  algorithm  not  really  to  /,  but  to  the  function  fx  =  f>x  +  / ,  where 
ifx  is  the  indicator  function  for  X.  It  is  zero  on  X  and  oo  elsewhere.  Thus,  when  fx  is 
evaluated  outside  of  the  domain  X,  then  the  function  /  is  not  evaluated  and  fx  is  set  to  oo. 
Also,  when  a  local  exploration  of  the  variable  space  is  performed  near  the  boundary  of  X , 
the  directions  that  define  the  exploration  are  related  to  the  local  geometry  of  the  boundary 
of  the  domain. 

GPS  algorithms  can  be  applied  to  these  problems,  particularly  when  /  is  quite  expensive 
to  evaluate,  and  GPS  can  also  be  used  in  the  surrogate  management  framework  [10:,  jTT] .  We 
can  not  justify  further  global  assumptions  about  the  class  of  problems  [10]  we  target,  but 
fortunately,  the  Clarke  [13]  calculus  enables  a  simple  unified  convergence  analysis  by  allowing 
for  fx  to  be  extended  valued  (see  0)-  GPS  algorithms  are  direct  search  algorithms:  They 
rely  only  on  function  values.  We  like  to  view  GPS  algorithms  as  having  each  iteration  divided 
in  two  steps  (as  presented  in  [IQ] ) :  A  global  exploration  of  the  variable  space  -to  identify 
promising  basins  containing  local  optima—  called  the  SEARCH  step,  and  a  local  exploration  -to 
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refine  the  solution-  called  the  poll  step.  Both  these  steps  are  fully  described  in  Sections  3.1 
and  3.2  but  this  paper  is  concerned  only  with  modifications  to  the  POLL  step.  Our  limited 


discussions  of  SEARCH  steps  are  only  in  the  context  of  being  used  with  one  of  the  modified 
POLL  steps  suggested  here,  and  their  purpose  is  to  alleviate  any  concern  a  reader  might  feel 
about  our  poll  steps  being  too  much  like  a  crude  gradient  descent  method. 


Derivative-free  GPS  algorithms  were  defined  by  Torczon  ra  for  unconstrained  optimiza¬ 
tion,  and  then  extended  by  Lewis  and  Torczon  to  bound  constrained  optimization  [23]  and 
to  problems  with  a  finite  number  of  linear  constraints  [23j .  Lewis  and  Torczon  [22]  present  a 
pattern  search  algorithm  for  general  constraints  through  a  sequence  of  augmented  Lagrangian 
subproblems.  Audet  and  Dennis  extend  GPS  algorithms  to  general  constraints  [7f  using  a 
filter-based  na  approach,  and  they  extend  GPS  to  categorical  optimization  variables  [6]  as 
well.  The  latter  are  discrete  variables  that  arise  often  in  engineering  design  but  cannot  be 
treated  by  the  common  branch  and  bound  techniques  (e.g.,  see  [20]  |TJ  where  engineering 
applications  with  categoriacal  variables  are  solved  by  a  GPS  algorithm). 


In  this  paper,  we  add  to  the  GPS  algorithm  class  the  possibility  of  using  any  derivative 
information  for  the  objective  function  that  might  be  available  at  an  iterate.  Our  objective  is 
to  decrease  the  worst  case  cost  of  a  poll  step  at  that  iterate.  To  test  our  proposals,  we  give 
some  numerical  results  with  empty  search  steps  to  compare  poll  steps  with  and  without 
derivative  information.  These  results  should  not  be  interpreted  as  saying  anything  about  the 
efficiency  or  robustness  of  GPS  algorithms  in  comparison  with  other  direct  search  methods. 
We  emphasize  that  their  only  purpose  is  to  test  our  ideas  for  GPS  poll  steps. 


We  will  not  discuss  using  derivative  information  in  the  search  step,  although  that  is 
an  interesting  direction  for  further  research.  The  algorithm  presented  here  reduces  to  the 
previous  GPS  versions  when  derivatives  are  not  available.  Abramson  [3]  synthesizes  all  this 
into  a  GPS  algorithm  for  generally  constrained  mixed  variable  problems  where  derivative 
use  is  optional. 


We  are  aware  that  users  may  choose  direct  search  methods  even  for  problems  where 
V  f(x),  or  a  finite  difference  approximation,  might  be  available  and  reliable.  They  are  willing 
to  accept  more  function  evaluations  in  hopes  of  finding  a  better  local  minimum  since  gradient- 
based  algorithms  have  a  tendency  to  converge  quickly  to  a  local  minimizer  in  the  same  basin 
as  the  starting  point.  In  the  tests  given  here  for  our  gradient- informed  POLL  steps  versus 
standard  GPS  poll  steps,  the  derivative  and  non-derivative  versions  using  the  same  positive 
spanning  set  are  equally  likely  to  find  a  better  local  minimizer.  We  provide  one  example  to 
illustrate  the  utility  of  a  simple  SEARCH  in  finding  a  better  minimizer  than  would  be  found 
with  no  SEARCH.  Our  example  supports  the  claim  that  locality  is  unlikely  to  be  a  problem 
in  practice  if  a  search  procedure  of  the  type  suggested  in  Section  |37T  were  used. 


Lewis  and  Torczon  iza  began  the  quest  to  reduce  the  worst  case  number  of  function  calls 
per  GPS  iteration.  They  showed  under  global  continuous  differentiability  of  /  that  the  set  of 
directions  used  in  the  POLL  step  can  be  composed  of  as  few  as  n  +  1  directions  if  their  nonneg¬ 
ative  linear  combinations  span  the  whole  space,  and  still  ensure  subsequence  convergence  to 
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a  stationary  point.  In  this  paper,  we  show  how  to  use  gradient  information  to  further  reduce 
the  worst  case  cost  to  one  function  evaluation  per  iteration  in  the  unconstrained  case,  and  to 
achieve  a  related  reduction  in  the  linearly  constrained  case  without  giving  up  perceived  GPS 
advantages  over  gradient-based  methods.  Indeed,  in  the  linearly  constrained  case,  we  may 
require  no  function  evaluations  and  no  additional  gradient  evaluations  for  some  iterations.  It 
is  noted  that  if  one  computes  gradient  approximations  by  finite  differences,  then  the  linearly 
constrained  case  is  the  only  one  likely  to  see  any  savings  from  the  use  of  such  an  expen¬ 
sive  gradient.  However,  for  many  engineering  problems,  inexpensive,  but  inexact,  gradient 
information  can  be  available,  and  that  is  all  we  need  for  our  modifications  (see  [SUEZ]). 

Our  results  do  not  assume  that  derivative  information  is  used  at  every  iterate,  nor  do 
we  assume  that  derivative  information  in  all  directions  is  used.  At  an  iterate  where  p  <  n 
approximate  partial  derivatives  are  used,  we  can  reduce  the  worst  case  cost  of  a  poll  step 
from  n  +  lton  +  l-p.  Thus,  if  even  a  rough  approximation  to  the  full  space  gradient  is 
available,  the  poll  step  reduces  to  1  evaluation  of  fx-  Of  course,  this  may  or  may  not  be 
a  worthwhile  savings  depending  on  the  cost  of  computing  the  derivative  information  at  an 
iterate.  However,  we  suspect  it  will  be  especially  worthwhile  near  convergence,  when  our 
POLL  steps  generate  points  close  enough  to  the  current  iterate  that  they  mimic  gradient 
descent  steps  and  are  surely  not  expected  to  find  a  different  basin. 

Whether  we  have  derivative  information  or  not,  polling  directions  considered  by  the  new 
algorithm  must  still  conform  to  the  geometry  of  X,  as  proposed  in  [21j,  i.e.,  they  must  contain 
the  generators  of  the  tangent  cone  at  all  nearby  boundary  points  (a  sufficient  condition  for 
this  property  is  the  rationality  of  the  matrix  A).  If  the  directions  do  not  conform  to  the 
boundary,  then  the  iterates  may  stagnate  at  a  non-optimal  point  even  for  a  linear  program. 
This  leads  to  very  satisfying  optimality  conditions  for  a  finite  number  of  linear  constraints 
under  local  smoothness  assumptions  on  /  (see  0)-  These  conditions  are  unlikely  to  be 
realized  for  more  general  constraints,  the  key  restriction  being  that  there  must  be  a  single 
finite  set  containing  generators  for  all  the  tangent  cones  to  the  boundary  of  the  feasible 
region. 

The  remainder  of  the  paper  is  as  follows:  In  the  next  section,  we  give  a  brief  collection 
of  information  about  positive  spanning  sets.  Next,  we  give  a  brief  description  of  the  GPS 
algorithm  class  and  modify  it  to  make  use  of  available  derivatives  in  the  POLL  step.  Then, 
we  propose  in  Section  [4]  a  way  to  select  polling  directions  that  reduce  as  much  as  possible  the 
number  of  function  evaluations  at  each  iteration.  We  also  discuss  the  case  of  incomplete  and 
inaccurate  derivative  information,  such  as  when  only  a  subset  of  the  partial  derivatives  can 
be  computed.  In  Section  |5j  we  show  that  the  key  convergence  result  for  the  GPS  algorithm 
still  holds  when  it  is  modified  to  use  any  available  derivatives.  In  Section  [6j  we  use  some 
unconstrained  and  bound  constrained  problems  from  the  CUTE  test  set  [9j  to  test  reasonable 
variants  of  our  approach  in  a  GPS  algorithm  with  an  empty  search  step.  To  illustrate  the 
value  of  even  a  simple  SEARCH  step,  we  give  an  academic  two-dimensional  example  in  which 
a  rudimentary  random  SEARCH  improves  the  quality  of  the  solution.  The  paper  concludes 
with  a  discussion  of  how  derivative  information  could  be  used  in  a  GPS  algorithm  that 
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includes  a  search  step. 

Notation.  M,  Z,  and  N  denote  the  set  of  real  numbers,  integers,  and  nonnegative 
integers,  respectively,  and  N  denotes  the  set  {1,2,...,  rz}.  For  i  G  N,  e;  denotes  the  ith 
coordinate  vector  in  Mn.  For  x  G  X  c  Mn,  the  tangent  cone  to  X  at  x  is  given  by  Tx(x)  = 
cl {n(w  —  x)  :  fi  G  M+,tc  6  1}  where  c/  denotes  the  closure,  and  the  normal  cone  Nx{x)  to 
X  at  x  is  the  polar  of  the  tangent  cone,  namely,  Nx  (x)  =  {v  G  :  vTw  <  0  V  w  G  Tx(x)}. 
The  normal  cone  is  the  nonnegative  span  of  all  outwardly  pointing  constraint  normals  at  x. 


2  Positive  Spanning  Sets 

Since  the  GPS  framework  and  theory  relies  on  the  use  of  positive  spanning  sets,  it  is  useful 
to  briefly  define  some  terminology  and  cite  some  important  results. 

The  following  terminology  and  key  theorem  are  due  to  Davis  P3J.  The  theorem  is  the 
motivation  behind  the  use  of  positive  spanning  sets  in  GPS  algorithms  m 

Definition  2.1  A  positive  combination  of  the  set  of  vectors  D  =  {dj}(=]  is  a  linear  combi- 

r 

nation  where  on  >  0,  i  =  1,  2, . . . ,  r. 

i—  1 

Definition  2.2  A  finite  set  of  vectors  D  =  {dj}{=1  forms  a  positive  spanning  set  for  if 
every  v  G  Mn  can  be  expressed  as  a  positive  combination  of  vectors  in  D.  The  set  of  vectors 
D  is  said  to  positively  span  Mn.  The  set  D  is  said  to  be  a  positive  basis  for  Mn  if  no  proper 
subset  of  D  positively  spans  Mn. 

Davis  [15]  shows  that  the  cardinality  of  any  positive  basis  (i.e.,  a  minimal  positive  span¬ 
ning  set)  in  Rn  is  between  n  +  1  and  2 n.  Common  choices  for  positive  bases  include  the 
columns  of  [/,  —  I]  and  [/,  — e],  where  /  is  the  identity  matrix  and  e  the  vector  of  ones.  In 
fact,  for  any  basis  V  G  Rnxn,  the  columns  of  [V,  — V }  and  [V,  —Ve]  are  easily  shown  to  be 
positive  bases. 

Theorem  2.3  A  set  D  positively  spans  MA  if  and  only  if,  for  all  nonzero  v  G  MA,  vTd  >  0 
for  some  d  G  D . 

Proof.  (Davis  [15])  Suppose  D  positively  spans  Mn.  Then  for  arbitrary  nonzero  v  G  Mn, 
v  =  aA  for  some  a*  >  0,  di  G  D,  i  —  1, 2, . . . ,  \D\.  Then  0  <  vTv  =  X^[=i  aidfv,  at 
least  one  term  of  which  must  be  positive.  Conversely,  suppose  D  does  not  positively  span 
Mn;  i.e.,  suppose  that  the  set  D  spans  only  a  convex  cone  that  is  a  proper  subset  of  Mn. 
Then  there  exists  a  nonzero  v  G  Mn  such  that  vTd  <  0  for  all  dGh.  ■ 

It  is  easy  to  see  that  for  any  positive  spanning  set  D,  if  V  f(x)  exists  at  x  and  is  nonzero, 
then  by  choosing  v  =  —Xffx)  in  this  theorem,  there  exists  a  d  G  D  satisfying  V f(x)Td  <  0. 
Thus  at  least  one  d  G  D  is  a  descent  direction  for  /  at  x. 
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3  Description  of  pattern  search  algorithms 


Pattern  search  algorithms  are  defined  through  a  finite  set  of  directions  used  at  each  iteration 
to  create  a  conceptual  mesh  on  which  trial  points  are  selected,  evaluated  and  compared 
against  the  current  best  solution,  called  the  incumbent.  The  basic  ingredient  in  the  definition 
of  the  mesh  is  a  set  of  positive  spanning  directions  D  in  Rn  constructed  by  following  rules: 
each  direction  dj  G  D  (for  j  —  1, 2, . . . ,  \D\)  is  the  product  Gzj  of  the  non-singular  generating 
matrix  G  G  Mnxn  by  an  integer  vector  z3  G  Zn.  Note  that  the  same  generating  matrix  is  used 
for  all  directions,  and  is  often  defined  to  be  the  identity  matrix.  These  conditions  are  essential 
to  the  convergence  theory  (see  [8],  where  the  same  notation  is  used).  We  conveniently  view 
the  set  D  as  a  real  n  x  \D\  matrix. 

At  iteration  k,  the  mesh  Mk  is  centered  around  the  current  iterate  xk  G  Rn,  with  its 
fineness  described  by  the  mesh  size  parameter  Ak  G  K"  as 

Mk  =  {xk  +  AkDz  :  z  G  N^}.  (2) 

For  any  mesh  Mkl  the  distance  between  any  two  distinct  mesh  points  is  bounded  below  by 
IIqAii  (see  [8],  Lemma  3.2). 


3.1  The  SEARCH  step 


At  each  iteration,  the  algorithm  performs  one  or  two  steps,  or  only  part  of  them.  The 
objective  function  is  evaluated  at  a  finite  set  of  mesh  trial  points  with  the  goal  of  improving 
the  incumbent  value.  If  and  when  an  improvement  is  found,  that  trial  point  becomes  the 
new  incumbent  and  the  iteration  is  declared  successful.  The  optional  first  step  is  called  the 
search  step.  Any  strategy  of  searching  a  finite  set  of  mesh  points  can  be  used.  It  can  be 
as  simple  as  doing  nothing  (he.,  the  finite  set  of  mesh  points  is  empty),  or  it  can  make  use 
of  heuristics,  such  as  genetic  algorithms,  or  surrogate  functions  based  on  interpolation  or  on 
simplified  models.  See  [ID]  and  m  to  see  how  this  can  be  done  effectively. 


At  this  point,  we  wish  to  make  a  distinction  between  two  types  of  search  strategies. 
In  one  case,  the  mesh  size  parameter  Ak  controls  not  only  the  resolution  of  the  mesh,  but 
the  size  of  the  step  as  well.  Such  is  the  case  with  the  PDS  or  MDS  method  of  Dennis  and 
Torczon  JIB]-  Dr  the  other  case,  the  SEARCH  routine  is  more  or  less  independent  of  Ak  (other 
than  ensuring  that  trial  point  in  fact  lie  on  the  mesh).  We  prefer  the  latter  type  because,  at 
least  near  the  beginning  of  the  run,  it  allows  for  a  more  diversified  sampling  of  the  space.  For 
example,  a  few  points  chosen  by  a  classical  space-filling  sampling  technique  like  the  Latin 
hypercube  [251  28]  has  led  to  good  local  solutions  in  our  experience  with  the  NOMADnr 
software  [2]  (e.g.,  see  the  simple  problem  in  Example  6.1). 


Obviously,  derivative  information  can  be  used  in  the  search  step;  for  example,  one 
might  use  Hermite  interpolants  as  surrogates,  but  that  is  outside  the  scope  of  this  paper. 
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Our  thesis  here  is  that  available  derivative  information  can  be  used  as  detailed  in  the  sequel 
to  reduce  the  number  of  trial  points  considered  in  the  POLL  step. 


3.2  The  POLL  step 


If  the  SEARCH  step  is  omitted  or  is  unsuccessful  in  improving  the  incumbent  value,  a  second 
step,  named  the  POLL  step,  must  be  conducted  before  terminating  the  iteration.  The  POLL 
step  consists  of  a  local  exploration  of  mesh  points  surrounding  the  incumbent  solution.  A 
global  search  step  as  described  above  is  where  the  global  quality  of  the  final  local  solution 
is  determined  most  effectively. 

When  the  poll  step  is  invoked,  the  objective  function  must  be  tested  at  certain  mesh 
points,  called  the  poll  set,  near  the  current  iterate  Xk  G  Mn.  At  each  iteration,  a  positive 
spanning  matrix  Dk  composed  of  a  subset  of  columns  of  D  is  used  to  construct  the  poll  set. 
We  write  /)/,.  C  D  to  signify  that  the  columns  of  the  matrix  Dj.  are  columns  of  D.  The  poll 
set  Pk  is  composed  of  mesh  points  that  lie  a  step  A *.  away  from  the  current  iterate  Xk  along 
the  directions  specified  by  the  columns  of  Dk- 

Pk  =  {a A:  +  A kd  :  d  G  Dk}-  (3) 

Rules  for  selecting  Dk  may  depend  on  the  iteration  number  and  on  the  current  iterate,  i.e., 
Dk  =  D(k,Xk )  C  D.  We  will  show  in  Section [4] how  to  exploit  this  freedom  when  derivatives 
are  available  at  iteration  k. 

In  order  to  handle  a  finite  number  of  linear  constraints,  we  use  the  following  definition 
from  j8j. 

Definition  3.1  A  rule  for  selecting  the  positive  spanning  sets  Dk  =  D{k,Xk)  Q  D  conforms 
to  A"  for  some  e  >  0,  if  at  each  iteration  k  and  for  each  y  in  the  boundary  of  X  for  which 
|| y  —  x*;||  <  e,  the  nonnegative  linear  combinations  of  the  columns  of  a  subset  Dy(xk )  of  Dk 
generate  Tx(y)- 


This  definition  means  that  when  Xk  is  near  the  boundary  of  the  feasible  region  A",  then 
Dk  must  contain  the  directions  that  span  the  tangent  cones  at  all  nearby  boundary  points 
(see  pM|). 

The  rules  for  updating  the  mesh  size  parameter  are  as  follows.  Let  r  >  1  be  a  rational 
number  and  w~  <  —  1  and  w+  >  0  be  integers  constant  over  all  iterations.  At  the  end  of 
iteration  k,  the  mesh  size  parameter  is  set  to 


Afc+i  =  rWk  Afc,  where  wk  G 


{0, 1, . . . ,  w+}, 
{w~,l  +  ur,-..,-i}, 


if  the  iteration  is  successful 
otherwise. 


(4) 


By  modifying  the  mesh  size  parameters  this  way,  it  follows  that,  for  any  k  >  0,  there  exists 
an  integer  r^eZ  such  that  A k  =  TVk  Ao,  and  the  next  iterate  Xk+i  can  always  be  written  as 
x0  +  Yli=\  Aj-Dzj  for  some  set  of  vectors  {zi},  i  —  1,  2, . . . ,  k  in 
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3.3  Generalized  pattern  search  algorithm  with  a  pruned  poll  set 

The  main  difference  between  this  paper  and  previous  work  on  pattern  search  algorithms 
comes  from  the  following.  Whenever  the  derivative  information  for  the  objective  function  at 
the  current  iterate  xk  is  available,  it  can  be  used  to  prune  (or  eliminate  from  consideration) 
ascent  directions  from  the  poll  set,  thus  reducing  the  number  of  function  evaluations  required 
by  the  POLL  step.  Because  there  is  some  ambiguity,  at  least  in  French  and  English,  we  remark 
here  that  when  we  use  the  term  pruned  set  of  directions ,  or  pruned  set,  we  mean  what  is 
left  after  the  pruning  operation  removes  some  directions,  not  the  directions  that  have  been 
eliminated.  The  pruned  set  of  polling  directions,  denoted  Dl  C  Dk,  is  defined  as  follows. 

Definition  3.2  Given  a  positive  spanning  set  Dk,  the  pruned  set  of  polling  directions,  de¬ 
noted  by  Dpk  C  Dk,  is  given  by 

Dpk  =  {dEDk:dTVf(xk)<  0} 

when  the  gradient  V f{xk)  is  known,  or  by 

Dl  =  Dk\{deVk:f(xk]d)>  0}, 

when  incomplete  derivative  information  is  known,  where  Vk  C  Dk  is  the  set  of  directions  in 
which  the  directional  derivative  is  known.  In  the  case  where  the  gradient  is  known,  —Wf[xk) 
is  said  to  prune  Dk  to  Dpk. 

Note  that  the  pruning  operation  depends  only  on  xk,  Dk,  and  the  availability  of  the 
derivative  information.  It  is  independent  of  Ak.  Note  also  that  if  no  derivative  information 
is  available,  then  the  pruned  set  of  directions  Dl  equals  Dk.  We  can  now  define  the  pruned 
poll  set  at  iteration  k  as 


Pp  =  {xk  +  A  kd:deDl}.  (5) 

The  POLL  step  of  the  algorithm  evaluates  the  constraints  and  objective  at  points  in  Pf 
looking  for  an  improvement  in  fx  at  a  feasible  point.  If  no  improvement  is  found,  then  the 
incumbent  solution  xk  is  said  to  be  a  mesh  local  optimizer  with  respect  to  the  pruned  poll 
set.  Notice  that  if  all  the  points  y  in  the  pruned  poll  set  lie  outside  X,  then  'ifxijj)  =  oo  (and 
thus  fx(y)  =  oo)  for  all  y  e  Pf  and  /  would  not  be  evaluated  at  all  in  order  to  conclude 
that  xk  is  a  mesh  local  optimizer.  We  have  seen  large  savings  from  this  fact  in  preliminary 
tests  because  in  the  standard  poll  step,  the  only  poll  points  at  which  one  does  not  have 
to  evaluate  the  function  are  those  that  lie  outside  A".  The  value  of  /  is  required  for  all  poll 
points  that  lie  in  X,  even  they  lie  along  ascent  directions. 
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We  now  present  the  GPS  algorithm  that  uses  any  available  derivative  information. 


GPS  Algorithm  with  derivative  information 

•  Initialization: 

Let  xq  be  such  that  fx(x o)  is  finite,  fix  A0  >  0  and  set  the  iteration  counter  k  to  0. 

•  Search  and  poll  steps  on  current  mesh  Mk  (defined  in  equation  (j2])): 
Perform  the  SEARCH  and  possibly  the  POLL  steps  (or  only  part  of  the  steps  if  a  trial 
mesh  point  with  a  lower  objective  function  value  is  found). 

—  SEARCH:  Evaluate  fx  on  a  finite  subset  of  trial  points  on  the  mesh  Mk  (the 
strategy  that  selects  the  points  is  provided  by  the  user). 

—  poll:  Evaluate  fx  on  the  pruned  poll  set  Pf  (see  equation  ([5]))  around  xk  defined 
by  Ak  and  Dpk  C  Dk  (Dk  depends  on  k  and  xk,  and  Dk  on  the  availability  of 
derivatives). 

•  Parameter  update: 

If  the  search  or  poll  step  produces  an  improved  mesh  point,  i.e.,  a  feasible  iterate 
xk+i  G  Mk  nx  for  which  f(xk+ 1)  <  f(xk),  then  update  Afc+1  >  Ak  through  rule  Q. 
Otherwise,  xk  is  a  mesh  local  optimizer.  Set  xk+i  =  xk ,  decrease  Afc+1  <  Ak  through 

rule  Q. 

Increase  k  <—  k  +  1  and  go  back  to  the  search  and  poll  step. 


If  the  gradient  is  available  and  is  nonzero,  then  its  negative  must  prune  at  least  one 
column  of  Dk  since  there  must  be  at  least  one  column  for  which  dTV  f(xk)  >  0  (see  Tlieo- 
2.3).  Moreover,  it  could  prune  as  many  as  \Dk\  —  1  directions  (i.e.,  all  but  one),  leading 


rem 


to  considerable  savings  for  an  expensive  function.  In  Section  [4j  we  choose  a  particular  D  for 
which  very  rough  derivative  information  allows  us  to  prune  the  respective  special  choices  we 
make  for  Dk  to  singletons.  We  think  it  is  an  interesting  aside  that  the  single  poll  directions 
left  after  pruning  are  generally  the  negatives  of  the  t\  and  Ac  gradients  Ha- 


Of  course,  we  expect  that  our  pruning  operation  will  yield  more  iterations  in  which  the 
POLL  step  fails  to  find  an  improved  mesh  point,  but  this  should  happen  less  frequently  as 
the  mesh  size  parameter  gets  smaller.  Furthermore,  the  role  of  the  POLL  step  is  mainly  to 
determine  whether  the  current  iterate  is  a  mesh  local  optimizer.  This  means  that  the  string 
of  successful  searches  on  the  current  mesh  is  halted  and  searching  can  start  afresh  on  a  finer 
mesh.  But  for  all  sufficiently  small  values  of  the  mesh  size  parameter,  if  our  pruned  POLL 
fails,  then  the  “unpruned”  POLL  would  also  fail.  This  suggests  that  derivatives  might  be 
more  useful  when  the  mesh  parameter  is  small  and  the  poll  step  would  not  be  expected  to 
get  out  of  the  current  basin.  Of  course,  the  search  can  always  move  us  into  a  deeper  basin 
housing  a  better  local  optimum. 
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4  Using  derivative  information  to  prune  well 


Each  iteration  of  a  GPS  algorithm  is  dependent  upon  the  choice  of  a  positive  spanning  set 
Dk,  selected  from  the  columns  of  the  larger  finite  positive  spanning  set  D.  Here  it  is  useful  to 
think  of  D  as  preselected,  but  in  fact,  it  can  be  modified  finitely  many  times  at  the  discretion 
of  the  user. 


In  this  section,  we  first  consider  a  simple  measure  of  the  richness  of  the  set  D.  Theorem |2. 3 
says  that  a  positive  spanning  set  has  at  least  one  member  in  every  half  space.  We  are 
interested  in  a  positive  spanning  set  D  so  rich  in  directions  that,  for  every  half  space,  a 
positive  spanning  subset  of  vectors  in  D  can  be  selected  in  such  a  way  that  a  single  element 
of  the  subset  lies  in  the  half  space  in  question.  We  will  show  that  no  positive  basis  D  can 
be  so  rich,  but  we  will  prove  that  the  set  D  =  {—1,0,  l}n  has  this  property.  We  build  up  to 
this  point  by  considering  some  useful  measures  of  directional  richness  in  D.  The  upshot  is  a 
major  point  of  this  paper:  When  gradients  are  not  available,  positive  bases  are  useful  in  GPS 
because  they  give  small  poll  sets  m-  However,  when  derivative  information  is  available, 
positive  bases  may  be  a  poor  choice  because  a  richer  choice  of  directions  makes  it  easier  to 
isolate  a  descent  direction  and  prune  to  a  smaller  poll  set  than  would  result  from  prunning 
a  preselected  positive  basis. 


The  following  definition  is  consistent  with  the  earlier  Definition  |3.2[  although  it  is  more 
general  in  keeping  with  the  applications  to  follow. 


Definition  4.1  (i)  Given  a  positive  spanning  set  D  C  M.n  and  a  nonzero  vector  v  G 
let  D(y )  C  D  be  a  positive  spanning  set  that  minimizes  the  cardinality  of  the  pruned  set 
Dp(y)  =  {d  G  D{y)  :  dTv  >  0}  (that  cardinality  is  denoted  p(D,v)).  Then  the  vector  v  is 
said  to  prune  D  to  p(D,v )  vectors. 

(ii)  Let  p(D )  =  ma x{p(D,v)  :  v  G  Mn  \  {0}}.  Then  the  positive  spanning  set  D  is  said  to 
prune  to  p(D)  vectors. 


If  D  is  not  only  a  positive  spanning  set,  but  also  a  positive  basis,  then  one  would  expect 
to  see  less  pruning  in  general  since  D(v)  =  D  for  every  choice  of  v.  Indeed,  it  is  not  surprising 
for  positive  bases  that  p(D)  grows  with  the  dimension  n. 


Proposition  4.2  If  D  is  a  positive  spanning  set,  then  1  <  p(D,v )  <  p(D)  <  \D\  —  1  for  all 
nonzero  v  G  Rn.  Moreover,  if  D  is  a  positive  basis,  then  <  p(D)  <  \D\  <  2 n. 


Proof.  The  first  assertion  is  direct,  because  for  any  r  ^  0  and  any  choice  of  positive  spanning 
set,  there  must  be  at  least  one  member  of  the  set  that  makes  a  positive  inner  product  with 
v,  and  another  that  makes  a  negative  one  (see  Theorem  2.3).  Thus,  every  possible  Dp{y ) 
contains  at  least  one  element  and  at  most  ID  I  —  1. 


This  argument  also  guarantees  that  p(D)  <  \D\.  To  derive  a  lower  bound  on  p(D),  we 
simply  consider  v  and  —v.  Since  no  proper  subset  of  a  positive  basis  is  a  positive  spanning 
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set,  the  only  possible  choice  for  D{v)  and  for  D(—v)  is  D.  Thus,  Dp(y )  U  Dp(—v)  =  D, 


and  so  2 p(D)  >  p{D,v)  +  p(D,  — v )  >  \D\.  The  other  inequalities  follow  from  Theorem  2.3 
where  it  is  shown  that  2 n  >  \D\  >  n  +  1  for  any  positive  basis.  ■ 


Taking  v  as  the  negative  gradient  —'Vf(xk),  Definition  4.1  can  be  applied  to  the  poll  set 


of  a  GPS  algorithm,  as  introduced  in  Section  3.3 


Proposition  4.3  If  S/f(xk)  is  available  at  iteration  k  of  the  GPS  algorithm,  then  the  poll 
set  at  Xk  has  minimal  cardinality  p(D ,  —X7f(xk)),  which  is  an  upper  bound  on  the  number  of 
function  evaluations  required  to  execute  the  poll  step  using  a  minimal  poll  set. 


Proof.  From  the  discussion  following  Definition  3.2,  the  reader  will  see  that  p(D,  — V/(xfc)) 
represents  the  minimal  number  of  polling  directions  that  can  be  found  by  pruning  choices  of 
Dk  =  D(—'Vf(xk ))  C  D.  Consequently,  the  cardinality  of  the  pruned  poll  set  Pf  constructed 
from  Dk  =  Dp(— V/(xfc))  C  Dk  C  D  is  minimal.  ■ 


When  Pf  is  constructed  from  Dpk  =  Dp(—'\7ffxk)),  we  will  say  that  Pf  is  a  minimal  poll 
set.  The  above  results  imply  that  the  richer  the  directional  choice  in  a  positive  spanning  set 
D,  the  more  sagacity  can  be  employed  to  ensure  that  the  poll  set  can  be  extensively  pruned. 
In  the  next  section,  we  propose  a  set  of  directions  rich  enough  so  that,  when  the  gradient 
is  available,  the  upper  bound  on  the  number  of  function  evaluations  in  the  poll  step  (see 
Proposition  |4.3[)  is  one;  thus  I  f  contains  a  single  element. 


4.1  Pruning  with  the  gradient 

We  show  here  that  the  set  D  =  {—1,  0, 1}”  prunes  to  a  singleton,  i.e.,  for  any  nonzero  vector 
v  G  Mn,  there  exists  a  positive  spanning  set  O(u)  C  D  such  that  the  subset  of  vectors  of 
O(u)  that  makes  a  nonnegative  inner  product  with  v  consists  of  a  single  element.  Note  that 
the  set  D  is  never  actually  constructed  by  the  algorithm  -  it  serves  only  as  a  mechanism  to 
identify  a  pruned  poll  set  consisting  of  a  single  direction.  The  key  lies  in  the  construction  of 
O(u),  which  is  done  by  taking  the  union  of  the  ascent  directions  of  D  with  an  element  d  of 
D  that  satisfies  the  following  properties: 


if  Vi  =  0, 

then 

di 

=  o, 

(6) 

if  Vi  ^  0, 

then 

d% 

G  {0, sign(uj)}, 

(7) 

if  \vf  =  ini*,, 

then 

di 

=  sign(uj), 

(8) 

for  every  i  e  N  =  {1,  2, . . . ,  n},  with  the  convention  that  sign(0)  =  0.  Our  construction  will 
be  such  that  d  will  be  the  only  unpruned  member  of  O(u). 

We  first  show  the  existence  of  vectors  in  D  satisfying  properties  ([6])-([8]).  This  means  that 
there  are  various  choices  for  O(u).  Three  different  choices  for  the  single  unpruned  direction 
are  given,  and  we  will  discuss  their  significance  below. 
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Proposition  4.4  Let  D  =  {—1,0,  l}n.  For  any  nonzero  v  G 
satisfied  by  each  vector  d^\d^2\  and  d^°°\  where  for  each  i  G  N, 


d: P  = 


sign(fj)  if  \vi\  =  ||v||oo, 
0  otherwise, 

vTd 


d ^  G  arg  max  ..  , 

de d\{0}  ||d||2 

d[°°]  =  sign(ui). 


properties  {6j)-{s|)  are 

(9) 

(10) 

(11) 


Proof.  Let  v  be  a  nonzero  vector  in  Mn.  The  proof  is  trivial  for  Sl>  and  for  d^°°\  To  show 
the  result  for  d^2\  first  note  that,  since  d ®  GD  \  {0},  Hd^H2  G  N.  For  each  8  G  N,  define 
the  optimization  problem 


max  vtd 
d£  D 


(. Ps 


s.t.  \\d\\l  =  8. 


Let  N~  and  N+  be  a  partition  of  the  set  of  indices  N  such  that  |1V+|  =  8  and  \vj\  >  |uj| 
whenever  i  G  N~  and  j  G  N+.  Define  d6  G  so  that  df  —  0  for  each  i  G  N~  and 
dj  =  sign(uj)  for  each  j  G  N+.  It  follows  that  d5  is  an  optimal  solution  for  (P5). 


From  (10),  it  follows  that  d^  =  d6  for  some  8  not  less  than  the  number  of  elements 
of  v  with  magnitude  ||u||oo  and  not  greater  than  the  number  of  nonzero  elements  of  v.  By 
construction,  d5  and  thus  d^  satisfy  properties  (|6])-(j8|.  ■ 

The  next  theorem  shows  that  for  any  nonzero  vector  v  G  Mn,  any  direction  d  satisfying 
properties  ([6])-([8])  can  be  completed  into  a  positive  spanning  set  by  adding  the  directions  in 


(12) 


A(v)  =  [d  G  D  :  vTd  <  0}  , 
and  consequently,  v  prunes  D  to  the  single  vector  {d}. 

Theorem  4.5  The  set  D  =  {  —  1,  0,  l}n  has  p(O)  =  1. 

Proof.  Let  v  G  Rn  be  nonzero,  and  let  d  G  D  be  any  vector  satisfying  the  properties  §)- 
(^j),  and  define  O(u)  to  be  the  union  of  {d}  with  the  set  A (y)  of  directions  in  D  that  make 
negative  inner  products  with  d. 

Theorem  2.3  says  that  O(u)  =  {d}  U  A(v)  C  D  positively  spans  ML  if  and  only  if,  for 
any  nonzero  b  G  Mn  there  exists  a  vector  d  in  B(u)  such  that  bT d  >  0.  Let  b  G  Mn  be  a 
nonzero  vector.  If  bT d  ^  0,  then  clearly  either  of  ±d  makes  a  positive  inner  product  with 
b ,  in  which  case,  the  proof  is  complete  since  both  are  in  B(u).  Thus,  consider  the  situation 
where  bTd  =  0.  The  analysis  is  divided  in  two  disjoint  cases. 
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Case  1:  biVi  <  0  for  some  index  i  G  N.  Set  dj  =  0  for  j  E  N  \  {z}  and  di  =  sign(fej).  It 
follows  that  d  belongs  to  A(v)  C  D(n)  and  makes  a  positive  inner  product  with  b  since 

vTd  =  Vjdj  =  VjSign(bj)  =  —VjSigTi(vj)  =  —\vj\  <  0,  and 

bT d  =  bjdj  =  bjSign(bj)  =  \bj\  >  0. 

Case  2:  biVi  >  0  for  all  i  G  N.  From  equations  ([Hj)  and  ([t])  ,  we  have  >  0  for  all  i  G  N, 
and  since  bTd  =  0,  it  follows  that  fejdj  =  0  for  all  i  G  N.  Let  i,j  G  N  be  such  that  \vf\  >  \vf\ 
for  all  f  G  IV  and  bj  ^  0.  Then  bi  =  0,  dj  =  0,  and  by  equation  (8|,  dt  =  sign(nj)  ^  0. 
Furthermore,  since  dj  =  0,  v3  <  \vf\.  Set  dt  =  0  for  £  G  N  \  {i,j}  and  di  =  — sign(uj)  and 
dj  =  sign  (bj).  It  follows  that  d  belongs  to  A(v)  C  O(u)  and  makes  a  positive  inner  product 
with  b  since 

vTd  =  Vidi  +  Vjdj  =  —ViSign(vj)  +  VjSign(bj)  <  —  \vi\  +  \vj\  <  0,  and 

bT d  =  bidi  +  bjdj  =  bjsign(bj)  =  \bj\  >  0. 

Thus  Op(T)  =  {d},  and  the  proof  is  complete  since  v  ^  0  was  arbitrary.  ■ 

We  now  apply  these  results  to  the  determination  of  the  poll  set  in  the  GPS  algorithm.  Of 
course,  for  v  =  —  V f(xk),  the  set  A (— V/(xfc))  would  contain  only  ascent  directions,  which 
— V f(xk)  then  prunes  away.  Therefore,  the  directions  in  A (— V/(xfc))  do  not  need  to  be 
explicitly  constructed. 


Theorem  4.6  IfVf(xk)  is  available  at  iteration  k  of  the  GPS  algorithm  with  positive  span¬ 
ning  set  D  =  D  and  if  =  D (—Vf(xk))  =  {d}  U  A(—Vf(xk))  with  d  satisfying  the 
properties  0-0.  then  a  single  function  evaluation  is  required  for  the  poll  step. 


Proof.  Theorem  4.5  and  Proposition  4.3  with  v 


—Wffxk)  guarantee  the  result. 


Thus,  choosing  the  rich  set  of  directions  D  and  using  the  gradient  information  allows  us 
to  evaluate  the  barrier  objective  function  fx  at  a  single  poll  point  Xk  +  A kd.  This  may  not 
require  any  evaluations  of  the  objective  function  /  if  the  trial  point  lies  outside  of  A",  or 
obviously  if  V f{xk)  =  0. 


Again  as  an  aside,  we  mention  that  our  notation  is  meant  to  be  consistent  with  jT2]  in 
the  sense  that  for  v  =  —Vf(xk),  the  vectors  d ^  and  d (-00)  are  the  negatives  of  the  normalized 
i\  and  Ac  gradients  of  /  at  Xk,  respectively,  while  d ^  is  the  vector  in  D  that  makes  the 
smallest  angle  with  —  V  f(xk)- 
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4.2  Pruning  with  an  approximation  of  the  gradient 

In  many  engineering  applications,  derivatives  are  approximated  or  inaccurately  computed 
without  much  additional  cost  during  the  computation  of  the  objective  [251  EH-  Let  us  define 
a  measure  of  the  quality  of  an  approximation  of  a  vector. 


Definition  4.7  Let  g  be  a  nonzero  vector  in  Mn  and  e  >  0.  Define  Je(g)  =  {i  E  IV  :  |g*|+e  > 
Halloo},  and  f°r  every  i  E  N  set 

je r  x  =  f  sign (gi)  ifi  e  Je{g) 

\  0  otherwise. 


The  vector  g  is  said  to  be  an  e—  approximation  to  the  large  components  of  a  nonzero  vector 
dGI”  if  i  E  Je(g)  whenever  \vi\  =  ||u||oo  and  if  sign(gi)  =  sign(uj)  for  every  i  E  Je(g). 


Note  that  if  e  =  0,  then  de(g)  is  identical  to  d in  equation  ([9]) ,  and  if  e  =  ||g||oo,  then 
then  de(g)  is  identical  to  d (°°)  in  equation  (11).  The  following  result  implicitly  provides  a 
sufficiency  condition  on  the  quality  of  the  approximation. 


Proposition  4.8  Let  e  >  0  and  g  be  an  e— approximation  to  the  large  components  of  a 
nonzero  vector  v  E  Mn.  Then  de(g)  satisfies  properties 

Proof.  Let  e  >  0  and  g  0  be  an  e— approximation  to  the  large  components  of  v  0  G  Rn. 
The  set  Je(g)  is  nonempty  and  contains  every  i  for  which  \vf  =  ||u||oo-  If  i  E  Je(g)  then 
dfig)  =  sign(y/,)  =  sign(uj),  and  therefore  properties  (|6])-([8])  are  satished.  If  1  ^  Je(g)  then 
the  properties  are  trivially  satisfied.  ■ 

The  next  result  establishes  a  mild  accuracy  requirement  for  an  approximation  of  the 
negative  gradient  to  prune  B(— V f(xk))  to  a  singleton. 


Lemma  4.9  Let  e  >  0  and  g k  be  an  e— approximation  to  the  large  components  ofS/ffxk) 

0.  If  there  exists  a  vector  d  satisfying  equations  i@H&  for  both  v  =  —Vf(xk)  and  v  =  —gk, 
then  the  set  {d}  |J  A (— V/(xfc))  positively  spans  Rn  and  prunes  to  {d}. 


Proof.  This  follows  directly  from  the  proof  of  Theorem  4.5 


The  significance  of  this  result  resides  in  the  wide  latitude  a 


For  example,  if  d ^  with  v  =  —gk  is  used  (see  equation  (11 


lowed  for  the  approximation. 
))  to  obtain  d,  then  we  only 
need  the  components  of  the  approximation  gk  to  match  signs  with  those  of  the  true  gradient 
V  f(xk)  in  order  to  prune  to  a  singleton.  If  d^  with  v  =  —r/fc  is  used  (see  equation  ([9]) )  to 
obtain  d,  then  we  only  need  the  component  of  largest  magnitude  of  gk  to  have  the  same  sign 
as  that  of  V f{xk). 


The  following  result  shows  that  if  an  approximation  to  the  gradient  matches  signs  with 
the  true  gradient  for  all  components  of  sufficiently  large  magnitude,  then  the  approximation 
prunes  the  set  of  poll  directions  to  a  singleton. 
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Theorem  4.10  Let  e  >  0.  At  iteration  k  of  the  GPS  algorithm  with  D  =  D  =  {—1,  0,  l}n, 
let  gk  7^  0  be  an  e—  approximation  of  V  f(xk)  ^  0.  Then  the  set  of  directions  Dk  = 
{de(g)}  UM— WOlO)  positively  spans  W1.  Thus,  Dpk  =  {de(g)}. 


Proof.  This  follows  directly  by  combining  the  fact  that  d£(g)  is  the  same  if  constructed 


from  v  =  V  f(xk)  or  v  =  gk  and  by  Lemma  4.9  ■ 

Observe  that  when  e  =  0  or  e  =  ||p||oo,  then  Theorem  4.10  reduces  to  Theorem  4.6  with 
de(g)  =  d W  or  de(g)  =  d^°°\  respectively. 


4.3  Pruning  with  incomplete  derivative  information 

It  is  possible  that  in  some  instances  and  some  iterations,  the  entire  gradient  is  not  available, 
but  a  few  partial  or  directional  derivatives  might  be  known.  This  situation  may  occur  for 
example  when  f(x,y )  =  g(x,y )  x  h(y)  where  g(x,y )  is  analytically  given,  but  the  structure 
of  h(y)  is  unknown.  In  such  a  case  the  partial  derivatives  of  /  can  be  computed  with  respect 
to  x  but  not  with  respect  to  y. 

If  at  iteration  k,  the  directional  derivative  f{xk'i  d)  exists,  is  available,  and  is  nonnegative, 
then  polling  in  the  direction  d  from  xk  will  fail  if  the  mesh  size  parameter  is  small  enough. 
This  leads  to  the  following  result. 

Proposition  4.11  Let  Dk  C  D  be  a  positive  spanning  set  and  Vk  be  the  subset  of  directions 
d  in  Dk  for  which  the  directional  derivative  f(xh',  d)  >  0  is  known.  Then  the  pruned  set  of 
directions  is  Dpk  =  Dk  \  Vk  and  contains  \Dk\  —  14  directions. 

Proof.  The  result  follows  from  the  fact  that  the  directional  derivative  f'{xk,d )  equals 
dTVf(xk). 

Note  that  when  the  gradient  exists,  if  f'{xk\  d)  is  nonpositive,  then  f'{xk,  —d)  is  nonneg¬ 
ative.  The  most  typical  application  of  incomplete  gradient  information  is  when  some,  but 
not  all,  partial  derivatives  are  known.  The  spanning  set  D  can  be  used  to  reduce  as  much 
as  possible  the  size  of  the  pruned  poll  set.  The  approach  for  doing  so  can  be  outlined  as 
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follows: 

Pruning  Operation  for  Incomplete  Gradients 

For  Xk  G  Rn  and  a  set  of  nonzero  partial  derivatives  Jh-( xk ),  j  G  J  C  iV, 

•  Set  14  =  {sj-ej  :  j  G  J}  C  O,  where  Sj  =  sign  (jy-Gfc)) ,  j  £  ■/,  and  e*  denotes  the 
coordinate  vector,  i  E  N. 

•  Set  Wfc  =  {e^  :  £  G  N  \  J}. 

•  Set  u  =  —  Sjej  —  eg. 

jeJ  eeN\j 

•  Set  Dk  =  Vk  U  Wfc  U  {n},  and  prune  Dk  to  Dpk  =  W\.  U  {w}. 


For  this  construction,  we  can  prove  the  following  result: 

Corollary  4.12  Let  the  partial  derivatives  at  iteration  k,  Xk )  for  j  6  J  be  available 

and  all  nonzero.  If  V/(x k)  exists,  then  the  pruning  operation  yields  a  pruned  set  Dp  with 
n  +  1  —  |  J\  directions. 


Proof.  By  the  construction  above,  it  is  clear  that  B  =  14  U  Wk  forms  a  basis  for  Rn,  and 
u  =  —  Y^i=\  hi,  where  bt  G  B,i  =  1,2, ...  ,  n.  Thus  Dk  =  B  LS  {«}  forms  a  positive  basis 
for  Rn  with  \Dk\  —  n  +  1.  Furthermore,  since  V f(xk)  exists,  for  any  Vj  G  Vk,  we  have 

=  \-P-(xk)\  >  0.  Since  14  has  \J\  directions,  the 


f{xk\Vj)  =  Vf(xk)TVj  =  V  f(xk)Tsjej 


result  follows  from  Proposition  4.11 


The  following  example  illustrates  this  result. 


Example  4.13  Consider  f  :  M3  — >  R  with  frr{xk)  >  0  and  fr~(xk)  <  0,  where  f  is 


continuously  differentiable  at  xk.  Then,  with  D  =  (  —  1,  0,  l}r 


3x2 

by  defining  Dk  as 


Dk  — 


1  0  0-1 

0-10  1 
0  0  1-1 


the  pruned  set  Dpk  can  be  constructed  using  only  the  two  last  columns  of  Dk. 


The  results  derived  in  this  section  are  compatible  with  the  previous  ones.  Indeed,  if  the 
full  gradient  is  available,  and  if  all  its  components  are  nonzero,  then  using  D  the  pruned  set 


requires  n  +  l  —  n  =  1  direction,  i.e.,  p(O)  =  1  as  in  Theorem  4.5 
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4.4  Pruning  with  linear  constraints 

A  similar  strategy  can  be  used  in  the  bound  or  linearly  constrained  case.  Recall  that  X  = 
{x  G  :  Ax  <  b}  defines  the  feasible  region,  where  A  G  Qmxn  and  b  G  Mm.  Set  M  = 
{1,2,...,  m}  and  let  aj  be  the  j'-tli  row  of  A  for  j  G  M. 

For  e  >  0  and  x  G  X,  define  Ae(x)  =  {j  G  M  :  ajx  >  bj  —  e},  the  set  of  e-active 
constraints  (as  used  in  my  The  set  of  positive  spanning  directions  D  is  implicitly  defined 
to  contain  all  tangent  cone  generators  of  all  points  on  all  faces  of  the  polytope  A".  Obviously, 
this  set  is  never  explicitly  constructed.  However,  for  any  x  G  X  and  for  a  given  e  >  0,  we 
will  need  to  construct,  as  suggested  in  p3],  the  set  T[x)  C  D  of  tangent  cone  generators 
to  all  the  e-active  constraints.  With  this  construction,  if  D k  C  D  is  a  positive  spanning  set 
containing  T(xk),  then  Dk  will  conform  to  the  boundary  of  X  for  e  >  0. 

If  the  gradient  is  known,  then  we  can  find  a  bound  for  the  minimal  cardinality  of  the 
pruned  set  through  the  following  approach  and  subsequent  proposition. 


Pruning  Operation  under  Linear  Constraints 

For  Xk  G  Mn  with  known  gradient  X  f(xk), 

•  Let  T(xfc)  be  the  set  of  tangent  cone  generators  to  all  e-active  constraints. 

•  Set  D'  =  T{xk)  U  {d  G  O  :  Xf(xk)Td  >  0}. 

•  If  D’  positively  spans  Mn,  set  Dk  =  Dl\  otherwise,  set  Dk  =  D'U{d},  where  d  satisfies 
properties  (J6])-(|8|. 

•  Prune  Dk  to  Dpk. 


Proposition  4.14  Let  e  >  0.  IfXf(xk )  is  known  at  iteration  k,  then  the  pruning  operation 
yields  a  pruned  set  Dk  containing  at  most  \T(xk)\  +  1  directions. 

Proof.  By  construction,  in  the  worst  case,  Dk  =  T(xk)  U  {d}  U  {d  G  ED  :  V f(xk)Td  >  0}, 
and  Dpk  =  T(xk )  U  {d},  whose  cardinality  is  \T(xk)\  +  1.  ■ 

Again,  this  result  is  in  agreement  with  previous  ones:  in  the  unconstrained  case,  or  if 
there  are  no  e-active  constraints,  then  Ae(xk)  is  empty,  reducing  Dpk  to  a  singleton.  Also, 
when  {d}  is  used,  then  d  points  outside  X  for  some  boundary  point  within  e  of  xk  and  if  the 
trial  point  xk  +  Afd  does  not  belong  to  X,  then  /  will  not  be  evaluated  at  that  trial  point. 

If  only  incomplete  derivative  information  is  available,  the  gradient  cannot  be  used  as 
strongly  to  prune  the  set  of  directions.  Define  D'  =  T(xk )  U  {d  G  14  :  f(xk',d )  >  0}, 
then  construct  Dk  by  completing  (if  necessary)  D'  into  a  positive  spanning  set.  As  before, 
Dk  is  obtained  by  pruning  ascent  directions  from  Dk.  By  construction,  all  directions  in 
{d  G  14  :  f'(xk',d )  >  0}  will  get  pruned,  and  some  directions  in  T(xk)  might  get  pruned. 


May  23,  2003 


17 


5  Convergence  Results 

The  convergence  results  for  the  above  GPS  algorithm  with  gradients  requires  only  the  fol¬ 
lowing  assumptions. 

Al:  A  function  fx  =  f  +  ip x  {+oo}  is  available. 

A2:  The  constraint  matrix  A  is  rational. 

A3:  All  iterates  {op}  produced  by  the  algorithm  lie  in  a  compact  set. 


Assumption  Al  illustrates  the  lack  of  assumptions  on  /,  which  can  be  nondifferentiablc 
or  even  discontinuous  or  unbounded  for  some  x  in  A".  The  strength  of  our  convergence  result 
does  depend  on  the  local  smoothness  at  the  limit  point.  The  algorithms  presented  above  do 
not  require  the  gradient  to  be  known  (or  even  to  exist)  at  each  iterate  x &.  The  algorithms 
use  gradient  information  only  when  it  is  available. 


Assumption  A2  is  necessary  for  the  proof  that  the  mesh  size  parameter  gets  infinitely 
fine  since  it  is  related  to  the  conformity  of  the  selection  rule  for  D j.  (see  Definition  3.1). 


Assumption  A3  is  a  standard  one  for  similar  algorithms  (see  mam  m\)- 
It  ensures  that  the  algorithm  produces  a  limit  point.  It  is  implied  by  more  restrictive 
assumptions  such  as  assuming  that  the  level  set  (i  6  I  :  f{x)  <  f(x0)}  is  compact. 


For  completeness,  we  state  the  following  theorem  from  |8] .  The  easy  proof  and  a  discus¬ 
sion  can  be  found  there. 


Theorem  5.1  Under  assumptions  Al  and  A3,  there  exists  at  least  one  limit  point  of  the 
iteration  sequence  { x &}.  If  f  is  lower  semicontinuous  at  such  a  limit  point  x,  then  limfc  f[xk) 
exists  and  is  greater  than  or  equal  to  fix).  If  f  is  continuous  at  every  limit  point  of  {xk}, 
then  every  limit  point  has  the  same  function  value. 

The  following  result  was  first  shown  by  Torczon  [29]  for  unconstrained  optimization.  The 
proof  was  adapted  in  [8]  for  our  notation,  and  so  it  is  not  reproduced  here  since  it  is  not 
affected  by  the  contributions  of  the  present  paper. 


Proposition  5.2  The  mesh  size  parameters  satisfy  liminf  A*,  =  0. 

/c— >+oo 


In  [Ej,  an  example  shows  that  this  result  cannot  be  strengthened  to  linp  A*,  =  0  without 
additional  restrictions  on  the  algorithm.  Proposition  5.2  guarantees  that  there  are  infinitely 
many  iterations  for  which  the  incumbent  is  a  mesh  local  optimizer  since  the  mesh  size 
parameter  shrinks  only  at  such  iterations.  These  iterations  are  the  interesting  ones  as  we 
will  now  show. 


Definition  5.3  A  subsequence  of  the  GPS  iterates  {xk}keK  (for  some  subset  of  indices  K ) 
consisting  of  mesh  local  optimizers  is  said  to  be  a  refining  subsequence  if  {Ak}keK  converges 
to  zero. 


May  23,  2003 


18 


The  existence  of  convergent  refining  subsequences  follows  trivially  from  Proposition  |5.2 
and  from  Assumption  A3.  We  now  show  results  about  their  limit  x.  Keep  in  mind  that 
the  objective  function  /  is  not  evaluated  at  infeasible  trial  points.  The  proof  is  similar  to 
that  found  in  [8],  but  it  is  modified  to  take  into  account  the  iterations  where  the  gradient  is 
available. 


Theorem  5.4  Under  assumptions  A1-A3,  if  x  is  any  limit  of  a  refining  subsequence,  and  d 
is  any  direction  in  D  for  which  f  at  a  poll  step  was  evaluated  or  pruned  by  the  gradient  for 
infinitely  many  iterates  in  the  subsequence,  and  if  f  is  Lipschitz  near  x,  then  the  generalized 
directional  derivative  of  f  at  x  in  the  direction  d  is  nonnegative,  i.e.,  f°(x]d)  >  0. 


Proof.  Let  {xk}keK  be  a  refining  subsequence  with  limit  point  x .  Let  d  G  D  be  obtained  as 
in  the  statement  of  the  Theorem  (finiteness  of  D  ensures  the  existence  of  d).  The  analysis 
is  divided  in  two  cases. 


First,  consider  the  case  where  the  gradient  is  evaluated  only  a  finite  number  of  times 
in  the  subsequence  { Xk}keK ■  Then  these  finite  number  of  iterates  may  be  ignored,  and 
therefore,  for  k  sufficiently  large  all  the  poll  steps  in  the  direction  d,  Xk  +  A kd,  are  feasible. 
If  they  had  not  been,  then  fx  would  have  been  infinite  there  and  so  /  would  not  have  been 
evaluated  (recall  that  if  x  ^  X ,  then  fx(x)  is  set  at  +oo  and  f(x)  is  not  evaluated).  Note 
that  since  /  is  Lipschitz  near  x,  it  must  be  finite  near  x.  Thus,  we  have  that  infinitely  many 
of  the  right  hand  quotients  of  Clarke’s  generalized  directional  derivative  definition  [13] 


f°(x;d) 


f{y  +  td)  -f(y) 

hm  sup - 

V->x,  tj.0  t 


> 


lim  sup 

keK 


f(xk  +  A kd)  -  f(xk) 

Afc 


(13) 


are  defined,  and  in  fact  they  are  the  same  as  for  fx-  This  allows  us  to  conclude  that  all  of 
them  must  be  nonnegative  or  else  the  corresponding  poll  step  would  have  been  successful  in 
identifying  an  improved  mesh  point  (recall  that  refining  subsequences  are  constructed  from 
mesh  local  optimizers). 


Second,  consider  the  case  where  the  gradient  is  used  in  an  infinite  number  of  iterates  in 
the  subsequence.  Then  there  is  a  subsequence  that  converges  to  x  for  which  dTX  f{xk)  >  0 
and  thus  the  right  hand  side  of  (13)  is  bounded  below  by  zero.  ■ 


The  following  corollary  to  this  key  result  strengthens  Torczon’s  unconstrained  result.  In 
this  corollary,  we  will  assume  still  that  /  is  Lipschitz  near  x,  and  in  addition,  we  will  assume 
that  the  generalized  gradient  of  /  at  x  is  a  singleton.  This  is  equivalent  to  assuming  that  / 
is  strictly  differentiable  at  x,  i.e.,  that  Vf(x)  exists  and  lim  fU+tw)-f(u)  _  wAV f{x)  for 

y— *x,t\p  r 

all  i«6l"  (see  [13],  Proposition  2.2.1  or  Proposition  2.2.4).  The  proof  is  omitted  since  it  is 
identical  to  that  in  [8]. 


Theorem  5.5  Under  assumptions  A1  and  A3,  let  X  =  Mn  and  x  be  any  limit  of  a  refining 
subsequence.  If  f  is  strictly  differentiable  at  x,  then  V/(x)  =  0. 


May  23,  2003 


19 


For  linearly  constrained  optimization  we  get  the  following  result,  whose  proof  can  be 
found  in  [8], 

Theorem  5.6  Under  assumptions  A1-A3,  if  f  is  strictly  differentiable  at  a  limit  point  x  of  a 
refining  subsequence,  and  if  the  rule  for  selecting  the  positive  spanning  sets  Dk  =  D{k ,  xf)  Q 
D  conforms  to  X  for  an  e  >  0,  then  V  f(x)Tw  >  0  for  all  w  G  Tx(x),  and  —  V/(x)  G  Nx(x). 
Thus,  x  is  a  KKT  point. 


6  Numerical  experiments 

The  algorithm  described  in  Section  [3]  was  programmed  in  Matlab  [2j  and  applied  to  20 
problems  from  the  CUTE  [HJ  collection.  Most  of  these  problems  have  multiple  first-order 
stationary  points,  which  is  important  if  we  are  to  compare  the  solutions  found  by  various 
strategies  used  by  the  algorithm.  For  each  problem,  performance  of  the  algorithm  is  com¬ 
pared  using  seven  different  poll  strategies.  No  SEARCH  method  is  employed,  which  means 
that  the  algorithm  does  not  use  its  full  potential  to  identify  deeper  basins  (this  is  why  the 
values  presented  are  sometimes  higher  than  the  best  known  values).  All  runs  terminated 
either  because  A*.  <  10~4  or  50,000  function  evaluations  have  been  performed.  (The  latter 
was  the  reasonable  limit  for  the  Matlab  NOMADm  code.)  The  initial  mesh  size  parameter 
was  set  to  Ao  =  1,  and  was  multiplied  by  2  when  an  improved  mesh  point  was  identified, 
and  divided  by  2  at  mesh  local  optimizers. 

Table  [l]  shows  objective  function  value  attained,  number  of  function  evaluations,  and 
number  of  gradient  evaluations  for  each  problem.  An  asterisk  appearing  after  the  problem 
name  indicates  that  the  problem  has  at  least  one  bound  constraint  (none  had  more  general 
linear  constraints).  The  seven  poll  strategies  identified  in  the  column  headings  of  Table  [l] 
differ  in  their  choice  of  directions  and  are  described  as  follows: 

•  Stand2,p  standard  2 n  directions,  D  =  [/,  —  I]  with  no  gradient-pruning. 

•  Standra+1:  standard  n  +  1  directions,  D  =  [/,  — e]  with  no  gradient-pruning,  where  e 
is  the  vector  of  ones. 

•  Grad2n:  gradient-pruned  subset  of  the  standard  2 n  directions. 

•  Gradn+i:  gradient-pruned  subset  of  the  standard  n  +  1  directions. 

•  Gradgi-*:  gradient-pruned  subset  of  D  =  {  —  1,  0,  l}n,  pruned  by  df~l\ 

•  Gradg„  :  gradient-pruned  subset  of  D  =  {  —  1,  0,  l}n,  pruned  by  c^2k 

•  Gradg„  :  gradient-pruned  subset  of  D  =  {—1,  0,  l}n,  pruned  by  d^°°\ 
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(2)  (o\ 

•  Gradgn  2n:  the  union  of  the  sets  of  Grady/  and  Grad2n,  explored  in  that  order. 

Although  no  search  step  was  employed  in  our  test  runs  (because  we  wanted  to  have 
a  fair  comparison  of  POLL  steps),  we  include  the  following  simple  example  to  illustrate  the 
value  of  even  a  very  basic  search  step. 

Example  6.1  We  ran  NOMADm  on  the  problem, 

min  f(x,  y)  =  x3  +  y3  —  10(a:2  +  y 2) 

x,  y 

s.t.  —5<x<  10,  —5  <  y  <  10 

with  an  initial  point  of  x0  =  (0.5,  0.5).  The  results  were: 

•  Stand2n  found  x  =  (6.666,  —5)  with  f(x)  =  —523  in  94  evaluations  of  f  and  no 
evaluations  ofVf. 

•  Grad^  found  x  =  (6.666,  6.666)  with  f(x)  =  —296  in  33  evaluations  of  f  and  17 
evaluations  ofVf. 

•  Gradg„  with  only  a  two  point  feasible  random  search  found  x  =  (—5,  —5)  with 
fix)  =  —750  in  85  evaluations  of  f  and  5  evaluations  ofWf. 

Thus,  Stand2ri  finds  a  good  local  minimizer,  and  Grad^  without  a  search  finds  a  higher 
local  minimum,  but  requires  fewer  function  evaluations  to  do  so.  But,  when  a  2-point  random 
search  is  added  to  each  iteration  of  Grad^  ,  the  global  minimum  is  found,  despite  requiring 
fewer  function  evaluations  than  Stand2n. 

For  the  CUTE  problem  results  given  in  Table  [lj  numbers  appearing  in  parentheses  indicate 
that  a  second  set  of  runs  was  performed  for  that  problem,  but  with  a  slightly  perturbed 
initial  point.  In  these  cases,  if  a  particular  entry  contains  no  number  in  parentheses,  then 
that  value  remained  unchanged  in  the  second  run.  The  initial  point  was  perturbed  whenever 
the  number  of  required  function  evaluations  for  a  run  was  very  small.  This  occurs  when  the 
choices  of  initial  point  and  poll  directions  quickly  drive  the  algorithm  to  an  exact  stationary 
point  [i.e.,  Vf(xk)  =  0).  Since  this  phenomenon  is  not  common  in  practice,  we  didn’t  want 
to  make  a  particular  strategy  to  appear  gratuitously  advantageous  over  another. 

It  should  be  noted  that  three  problems,  ALLINIT,  ALLINITU,  and  EXPFIT  do  not 
have  starting  points  associated  with  them.  Since  ALLINIT  is  the  constrained  version  of 
ALLINITU,  we  chose  the  same  reasonable  and  feasible  initial  point  for  both  problems,  while 
the  vector  of  ones  was  chosen  for  EXPFIT.  Also,  the  problem  OSLBQP  has  an  infeasible 
starting  point  with  respect  to  one  of  its  lower  bounds.  Our  NOMADm  software  automatically 
restores  any  such  variable  to  its  nearest  bound  before  proceeding  with  the  algorithm.  (In 
the  case  of  general  linear  constraints,  our  software  shifts  an  infeasible  starting  point  to  the 
nearest  feasible  point  with  respect  to  the  Euclidean  norm.) 
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Tabic  1:  Numerical  results  for  selected  CUTE  test  problems. 


ALLINIT*,  n  =  4 

Stand2„ 

Stand„+i 

Grad2n 

Gradn+i 

Grad  3^ 

Gradgn^ 

Gradgi^ 

Gl'ad3",2n 

Optimal  /-value 

26.7643 

26.7643 

26.7643 

26.7643 

26.7643 

26.7643 

26.7643 

26.7643 

Function  evaluations 

85 

85 

47 

47 

47 

56 

47 

56 

Gradient  evaluations 

0 

0 

10 

10 

10 

10 

10 

10 

ALLINITU,  n  =  4 

Stand2n 

Standn-|_i 

Grad2n 

Gradn_|_i 

Grad^ 

GradG) 

Gradg^ 

Grad3™,2n 

Optimal  /-value 

6.9288 

5.7444 

6.9288 

5.7661 

5.8006 

5.7812 

9.2523 

6.9293 

Function  evaluations 

1628 

425 

846 

133 

30 

65 

25 

650 

Gradient  evaluations 

0 

0 

140 

20 

7 

14 

7 

65 

BARD,  n  —  3 

Stand2„ 

Stand„+i 

Grad2n 

Gradn+i 

GradSj™ 

Gradgn^ 

GradSj^ 

Grad3+2n 

Optimal  /-value 

0.0082 

0.0122 

0.0082 

0.0089 

0.0089 

0.0083 

0.0091 

0.0082 

Function  evaluations 

11061 

50000+ 

5963 

50000+ 

2430 

7799 

1263 

4871 

Gradient  evaluations 

0 

0 

1640 

16384 

1018 

2474 

636 

722 

BOX2,  n  =  3 

Stand2n 

Standn+i 

Grad2n 

Gradn+i 

Grad^ 

GradG) 

Gradg^ 

Gl'ad3",2n 

Optimal  /-value 

0 

0 

0 

0 

0 

0 

0 

0 

Function  evaluations 

61 

48 

31 

32 

25 

25 

25 

56 

Gradient  evaluations 

0 

0 

2 

2 

8 

8 

8 

4 

BOX3,  n  =  3 

Stand2„ 

Standn+i 

Grad2n 

Gradn+i 

Grad^ 

Gradg^ 

Gradij^ 

Grad3+2n 

Optimal  /-value 

0 

0 

0 

0 

0.5656 

0.2253 

0.5656 

0 

Function  evaluations 

91 

63 

46 

32 

17 

32 

17 

62 

Gradient  evaluations 

0 

0 

2 

2 

2 

2 

2 

2 

DENSCHNA.  n  =  2 

Stand2n 

Stand„4.i 

Grad2n 

Gradn+i 

Grad^ 

GradG) 

Gradg^ 

Gl'ad3",2n 

Optimal  /-value 

0 

0 

0 

0 

0 

0 

0 

0 

Function  evaluations 
Gradient  evaluations 

73(148) 

0 

47(156) 

0 

67(86) 

4(22) 

47(92) 

2(30) 

5(56) 

3(20) 

2(23) 

2(8) 

2(23) 

2(8) 

2(83) 

2(14) 

DENSCHNB.  n  =  2 

Stand2„ 

Stand„4-i 

Grad2n 

Gradn+i 

GradSj™ 

Grad^? 

GradSj^ 

Gl'ad3+2n 

Optimal  /-value 

0 

0 

0 

0 

0 

0 

0 

0 

Function  evaluations 
Gradient  evaluations 

68(119) 

0 

130(95) 

0 

68(66) 

3(15) 

87(53) 

29(16) 

6(50) 

3(16) 

4(28) 

3(11) 

4(27) 

3(10) 

4(71) 

3(10) 

DENSCHNC,  n  =  2 

Stand2n 

Standn_|_i 

Grad2n 

Grad„_|_i 

Grad^ 

Grad  3^ 

Gradg^ 

Gl'ad3",2n 

Optimal  /-value 

0 

0(0.0001) 

0 

0(0.001) 

0 

0 

0 

0 

Function  evaluations 
Gradient  evaluations 

75(119) 

0 

54(662) 

0 

67(68) 

5(15) 

50(384) 

4(168) 

7(87) 

4(31) 

5(103) 

3(33) 

4(62) 

3(28) 

16(98) 

5(16) 

EXPFIT,  n  =  2 

Stand2rl 

Standn+i 

Grad2n 

Gradn+i 

GradSj™ 

Gradgn^ 

GradSj^ 

Grad3+2n 

Optimal  /-value 

0.2405 

0.2406 

0.2405 

0.2406 

0.2405 

0.2405 

0.2405 

0.2405 

Function  evaluations 

300 

999 

191 

524 

164 

163 

81 

198 

Gradient  evaluations 

0 

0 

66 

251 

66 

69 

37 

47 

MARATOSB,  n  =  2 

Stand2n 

Stand„_|_i 

Grad2n 

Gradn+i 

Grad^ 

Gradg^ 

Grad^^ 

Gl'ad3",2n 

Optimal  /-value 

-i 

0 

-0.0041 

0 

-i 

-1 

-i 

-1 

Function  evaluations 

66 

50 

42 

34 

17 

17 

17 

17 

Gradient  evaluations 

0 

0 

3 

3 

2 

2 

2 

2 
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MDHOLE*,  n  =  2 

Stand2„ 

Standn-|-i 

Grad2n 

Grad„+i 

Gradgi? 

GradG) 

Grad  3^ 

Gradgn  2n 

Optimal  /-value 

0 

0 

8130.9 

0.0196 

11127.6 

8130.9 

11127.6 

8130.9 

Function  evaluations 

37610 

48 

31 

5043 

15 

31 

15 

46 

Gradient  evaluations 

0 

0 

2 

1516 

1 

2 

1 

2 

MEXHAT,  n  =  2 

Stand2n 

Standn_|_i 

Grad2n 

Grad„+i 

Grad^n? 

GradG) 

Grad  3^^ 

Gradgn  2n 

Optimal  /-value 

-0.0401 

-0.0393 

-0.0401 

-0.0401 

-0.0401 

-0.0401 

-0.0401 

-0.0401 

Function  evaluations 

350 

69 

203 

31 

32 

35 

20 

285 

Gradient  evaluations 

0 

0 

54 

7 

4 

4 

4 

53 

MEYER3,  n  =  3 

Stand2„ 

Standn+i 

Grad2n 

Grad„+i 

Grad^l? 

Gradgn^ 

Grad  3^^ 

Gradgn  2n 

Optimal  /-value 

146888 

374774 

146888 

146888 

1947047 

302439 

1919139 

152238 

Function  evaluations 
Gradient  evaluations 

15102 

0 

18926 

0 

8561 

2206 

5207 

1644 

2214 

1099 

5252 

1903 

2266 

1128 

12596 

2527 

OSBORNEA.  n  =  5 

Stand2n 

Standn_|_i 

Grad2n 

Grad„+i 

Grad^ 

GradG) 

Grad  3^ 

Grad3«,2„ 

Optimal  /-value 

0.00106 

0.00029 

0.00106 

0.00029 

0.00185 

0.00015 

0.00089 

0.00012 

Function  evaluations 

2263 

13300 

1222 

7783 

496 

1455 

542 

1770 

Gradient  evaluations 

0 

0 

183 

2042 

286 

528 

314 

237 

OSBORNEB,  n  =  11 

Stand2„ 

Standn-|-i 

Grad2n 

Grad„+i 

GradSjJ? 

GradG) 

Grad  3^^ 

Grad3»,2» 

Optimal  /-value 

0.04014 

0.15192 

0.04014 

0.12229 

0.04131 

0.04088 

0.0404 

0.04014 

Function  evaluations 

23191 

50000+ 

11755 

50000+ 

1606 

3866 

975 

6574 

Gradient  evaluations 

0 

0 

736 

6479 

777 

1245 

481 

355 

OSLBQP*,  n  =  8 

Stand2n 

Standn+i 

Grad2n 

Grad„+i 

Grad^n? 

GradG) 

Grad  3^^ 

Grad3»,2„ 

Optimal  /-value 

6.25 

6.25 

6.25 

6.25 

6.25 

6.25 

6.25 

6.25 

Function  evaluations 

167 

1355 

115 

472 

5 

5 

5 

8 

Gradient  evaluations 

0 

0 

7 

101 

5 

5 

5 

6 

PALMER1*,  n  =  3 

Stand2„ 

Stand„+i 

Grad2n 

Grad„+i 

GradSjl? 

Grad$ 

Grad  3^^ 

Grad3»,2» 

Optimal  /-value 

11760 

21166 

11760 

21166 

43904 

43904 

43904 

11760 

Function  evaluations 

1498 

438 

730 

208 

31 

30 

31 

1013 

Gradient  evaluations 

0 

0 

160 

64 

11 

11 

13 

161 

PALMER1A*,  n  =  4 

Stand2n 

Standn_|_i 

Grad2n 

Gradn+i 

Gradg^ 

GradG) 

Grad  3?^ 

Grad3",2„ 

Optimal  /-value 

116.6 

7965.8 

41.3 

393.8 

2262.7 

4989.1 

729.3 

45.9 

Function  evaluations 
Gradient  evaluations 

50000+ 

0 

50000+ 

0 

50000+ 

5636 

50000+ 

9213 

50000+ 

20128 

50000+ 

17752 

50000+ 

24999 

50000+ 

4360 

PALMER1B*,  n .  =  2 

Stand2„ 

Standn-|-i 

Grad2n 

Grad„+i 

GradgJ? 

GradG) 

Grad  3^ 

Grad3»,2» 

Optimal  /-value 

4.0137 

9010.0 

3.4682 

6.3366 

9080.6 

13474.1 

5135.9 

3.4671 

Function  evaluations 
Gradient  evaluations 

50000+ 

0 

50000+ 

0 

43899 

7684 

50000+ 

11450 

50000+ 

20441 

50000+ 

25001 

50000+ 

25001 

44127 

5149 

PALMERIC,  n  =  8 

Stand2n 

Standn_|_i 

Grad2n 

Gradn+i 

Grad^ 

GradG) 

Grad  3^^ 

Grad3»,2» 

Optimal  /-value 

16948398 

30078 

12514913 

30078 

39201 

39859 

41284 

799100 

Function  evaluations 

50000+ 

36077 

50000+ 

5483 

2752 

3788 

252 

50000+ 

Gradient  evaluations 

0 

0 

5535 

2491 

1033 

1140 

120 

4230 
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In  general,  the  gradient-pruned  GPS  POLL-only  methods  converged  in  fewer  function 
evaluations  than  the  standard  GPS  POLL-only  methods,  but  they  sometimes  terminated  at  a 
point  with  a  higher  function  value,  particularly  if  only  one  gradient-pruned  descent  direction 
was  used  in  the  POLL  step.  In  fact,  in  no  case  did  a  standard  method  require  fewer  function 
evaluations  than  a  “3n”  gradient  method,  and  in  only  one  case  (MDHOLE,  Standn+i)  did 
a  standard  method  require  fewer  function  evaluations  than  its  gradient-pruned  counterpart 

with  the  same  poll  directions  (e.g.,  Stand2n  versus  Grad2n,  Standn+i  versus  Gradn+i).  Also, 
(2) 

Grady  2n  was  always  at  least  as  fast  as  Stand2n. 

However,  the  problems  ALLINITU,  MARATOSB,  MEXHAT,  and  OSBORNEA  are  ex¬ 
amples  of  cases  where  a  gradient-pruned  method  converged  to  a  lower  point  than  a  standard 
poll.  In  fact,  Gradg^  (with  only  one  function  evaluation  per  iteration)  achieved  a  lower 
function  value  for  OSBORNEA  than  both  standard  poll  methods  despite  fewer  function 
evaluations. 

The  problem  PALMER1  is  an  interesting  example  with  unflattering  behavior.  The  three 
highly-pruned  Grada^  strategies  converge  the  quickest,  but  to  the  much  poorer  solution  of 
43904  than  the  other  methods.  The  methods  that  use  the  most  directions,  Stand2n,  Grad2n, 
and  Grady 2n,  converge  to  the  best  solution  of  11760,  with  Grad2n  saving  a  considerable 
number  of  function  evaluations  over  Stand2n.  The  two  n  +  1  direction  methods  converge 
to  a  solution  of  21166,  with  gradient-pruning  again  saving  a  significant  number  of  function 
evaluations.  Finally,  the  Grady  2n  strategy  achieves  the  best  optimal  value  faster  than 
Stand2n,  but  it  was  not  able  to  match  Grad2n  in  function  evaluations,  perhaps  because  it 
requires  slightly  more  function  evaluations  per  POLL  step. 


7  Discussion 


We  have  presented  a  GPS  algorithm  that  uses  any  available  derivative  information  to  reduce 
the  size  of  the  poll  set.  Numerical  results  with  a  GPS  POLL-only  method  applied  to  problems 
from  the  CUTE  test  collection  suggest  that  the  use  of  gradient  information  in  the  POLL 
step  often  requires  fewer  function  evaluations  to  terminate  than  a  standard  POLL-only  GPS 
algorithm  that  does  not  uses  gradient  information.  We  emphasize  that  we  envision  this  new 


POLL  step  being  used  with  a  search  step  with  a  global  reach  as  explained  in  Section  3.1 


The  new  approach  does  not  necessarily  save  computational  time  if  the  time  for  computing  a 
gradient  is  expensive,  as  for  example,  when  it  costs  n  times  the  cost  to  evaluate  the  objective 
function  as  in  a  finite  difference  gradient.  Moreover,  the  two  approaches  often  converge  to 
different  minimizers  with  different  objective  function  values. 
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